\( \newcommand{\water}{{\rm H_{2}O}} \newcommand{\R}{\mathbb{R}} \newcommand{\N}{\mathbb{N}} \newcommand{\Z}{\mathbb{Z}} \newcommand{\Q}{\mathbb{Q}} \newcommand{\E}{\mathbb{E}} \newcommand{\d}{\mathop{}\!\mathrm{d}} \newcommand{\grad}{\nabla} \newcommand{\T}{^\text{T}} \newcommand{\mathbbone}{\unicode{x1D7D9}} \renewcommand{\:}{\enspace} \DeclareMathOperator*{\argmax}{arg\,max} \DeclareMathOperator*{\argmin}{arg\,min} \DeclareMathOperator{\Tr}{Tr} \newcommand{\norm}[1]{\lVert #1\rVert} \newcommand{\KL}[2]{ \text{KL}\left(\left.\rule{0pt}{10pt} #1 \; \right\| \; #2 \right) } \newcommand{\slashfrac}[2]{\left.#1\middle/#2\right.} \)

Reparameterization trick: A Monte Carlo gradient estimator

Problem

Suppose you have to compute the gradient of an expectation:

\[ \nabla_\phi \; \mathbb{E}_{q_\phi(z)} \big[ f(z) \big] \]

where, problematically, the gradient \(\nabla_\phi\) is with respect of the parameters of the distribution \(q_\phi(z)\), so we cannot bring the gradient into the expectation.

Idea for solution

The reparameterization trick changes the original distribution, which depends on \(\; \phi \;\) (\(\; q_\phi(z) \;\)) to a different one that is independent of \(\; \phi \;\) (\(\; p(z) \;\)). This way, we can bring the gradient into the expectation, transforming the difficult gradient into an expectation that is simple to approximate.

Solution

Instead of sampling \(z_i \sim q_\phi(z)\), sample \(\; \varepsilon_i \sim p(\varepsilon) \;\) and transform it into pseudosamples \(\; \tilde{z}_i \;\) that have the same distribution as \(\; z_i \;\). Transform \(\; \varepsilon_i \;\) into \(\; \tilde{z}_i \;\) with the function \(\tilde{z}_i = g(\phi,\varepsilon_i)\). The function \(g(\phi,\varepsilon)\) and the distribution \(p(\varepsilon)\) are designed on purpose so that \(\tilde{z}_i \sim q_\phi(z)\), even if we are not sampling from \(q_\phi(z)\) directly.

Thus, we reach the following equivalence:

\[ \mathbb{E}_{q_\phi(z)} \big[ f(z) \big] = \mathbb{E}_{p(\varepsilon)}\big[ f(\tilde{z}) \big] \]

And the gradient is:

\begin{align} \nabla_\phi \, \mathbb{E}_{q_\phi(z)} \big[ f(z) \big] & = \nabla_\phi \, \mathbb{E}_{p(\varepsilon)}\bigg[\, f\big(\,g(\phi,\varepsilon)\,\big) \bigg] \\[12pt] & = \mathbb{E}_{ p(\varepsilon)}\bigg[\, \nabla_\phi \, f\big(\,g(\phi,\varepsilon)\,\big) \bigg] \\[12pt] & = \; \mathbb{E}_{p(\varepsilon)}\bigg[ \, \nabla_{g(\phi,\varepsilon) }\,f(g(\phi,\varepsilon)) \, \nabla_\phi\, g(\phi,\varepsilon) \, \bigg] \end{align}

Now, you can estimate this by Monte Carlo approximation, sampling \(\varepsilon_i \sim p(\varepsilon), \; i = 1, \ldots, n\), and computing the mean.

Performance

The reparameterization trick usually produces gradient estimates with lower variance than those of the score function estimator.

Example

(1D example for simplicity)

In VAEs, each latent vector \(\; z_i \;\) is sampled from a normal with mean \(\; \mu_\phi(x) \;\) and variance \(\; \sigma^2_\phi(x) \;\). Remember that \(\; \mu_\phi \;\) and \(\;\sigma^2_\phi \;\) are the output of the encoder, and \(\; \phi \;\) are the parameters of the encoder.

VAEs use the following reparameterization trick is used:

\begin{align} z \sim N\big(\, z \;\big|\; & \mu_\phi(x),\; \sigma_\phi^2(x)\,\big) \\[8pt] & \Updownarrow \\[8pt] z = g(x,\varepsilon) := \mu_\phi(x) & + \sigma_\phi(x) \varepsilon,\;\; \varepsilon \sim N(0,1) \end{align}